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Abstract. Hashing with linear probing dates back to the 1950s, and is among the most studied 
algorithms. In recent years it has become one of the most important hash table organizations since it uses 
the cache of modern computers very well. Unfortunately, previous analysis rely either on complicated 
and space consuming hash functions, or on the unrealistic assumption of free access to a truly random 
hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the 
■ question of extending their analysis to linear probing. However, we show in this paper that linear 

probing using a pairwise independent family may have expected logarithmic cost per operation. On 
q . the positive side, we show that 5-wise independence is enough to ensure constant expected time per 

operation. This resolves the question of finding a space and time efficient hash function that provably 
ensures good performance for linear probing. 
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Hashing with linear probing is perhaps the simplest algorithm for storing and accessing a set of 
q \ keys that obtains nontrivial performance. Given a hash function h, a key x is inserted in an array 

by searching for the first vacant array position in the sequence h(x), h(x) + 1, h{x) + 2, . . . (Here, 
addition is modulo r, the size of the array.) Retrieval of a key proceeds similarly, until either 
the key is found, or a vacant position is encountered, in which case the key is not present in the 
data structure. Linear probing dates back to 1954, but was first analyzed by Knuth in a 1963 
memorandum [8] now considered to be the birth of the area of analysis of algorithms [10]. Knuth's 
analysis, as well as most of the work that has since gone into understanding the properties of 
linear probing, is based on the assumption that h is a truly random function. In 1977, Carter and 
Wegman's notion of universal hashing [3] initiated a new era in the design of hashing algorithms, 
where explicit and efficient ways of choosing hash functions replaced the unrealistic assumption of 
complete randomness. In their seminal paper, Carter and Wegman state it as an open problem to 
"Extend the analysis to [...] double hashing and open addressing." 1 

X" 

1.1 Previous results using limited randomness 

The first analysis of linear probing relying only on limited randomness was given by Siegel in [11, 
12]. Specifically, he shows that 0(logn)-wise independence is sufficient to achieve essentially the 
same performance as in the fully random case. However, another paper by Siegel [13] shows that 
evaluation of a hash function from a 0(logn)-wise independent family requires time ,!?(logn) unless 
the space used to describe the function is n^ 1 ** . A family of functions is given that achieves space 
usage n e and constant time evaluation of functions, for any e > 0. However, this result is only 
of theoretical interest since the associated constants are very large (and growing exponentially 
with 1/e). 



1 Nowadays the term "open addressing" refers to any hashing scheme where the data structure is an array containing 
only keys and empty locations. However, Knuth used the term to refer to linear probing in [8], and since it is 
mentioned here together with the double hashing probe sequence, we believe that it refers to linear probing. 



A potentially more practical method due to Dietzfelbinger (seemingly described in the literature 
only as a "personal communication" in [5]) can be used to achieve characteristics similar to those 
of linear probing, still using space n € . This method splits the problem into many subproblems of 
roughly the same size, and simulates full randomness on each part. Thus, the resulting solution 
would be a collection of linear probing hash tables. 

A significant drawback of both methods above, besides a large number of instructions for func- 
tion evaluation, is the use of random accesses to the hash function description. The strength of 
linear probing is that for many practical parameters, almost all lookups will incur only a single 
cache miss. Performing random accesses while computing the hash function value may destroy this 
advantage. 

1.2 Our results 

We show in this paper that linear probing using a pairwise independent family may have expected 
logarithmic cost per operation. Specifically, we resolve the open problem of Carter and Wegman by 
showing that linear probing insertion of n keys in a table of size 2n using a function of the form 
x i — ^ ((ax + b) mod p) mod 2n, where p = 4n + 1 is prime and we randomly choose a £ [p]\{0} 
and b £ \p], requires Q{n\ogn) insertion steps in expectation for a worst case insertion sequence 
(chosen independently of a and b). Since the total insertion cost equals the total cost of looking up 
all keys, the expected average time to look up a key in the resulting hash table is Q(\ogn). The 
main observation behind the proof is that if a is the multiplicative inverse (modulo p) of a small 
integer m, then inserting a certain set that consists of two intervals has expected cost 0(n 2 /m). 

On the positive side, we show that 5-wise independence is enough to ensure constant amortized 
expected time per insertion, for load factor a = n/r bounded away from 1. This implies that the 
expected average time for a successful search is constant. We also show a constant expected time 
bound for unsuccessful searches. Our proof is based on a new way of bounding the cost of linear 
probing insertions, in terms of "fully loaded" intervals I, where the number of probe sequences 
starting in I is at least |7|. 

Our analysis of linear probing gives a bound of nil + 0{ nr^p )) steps for n insertions. This 
implies a bound of 1 + 0( ri—a)* ) s ^ e P s on avera g e f° r successful searches. For higher values of a, 
these bounds are a factor f^(jz^) higher than for linear probing with full independence. To get a 
better dependence on a we introduce a class of open addressing methods called blocked probing, 
which includes a special kind of bidirectional linear probing. For this scheme, which shares the cache 
friendliness of traditional linear probing, we get the same dependence on a (up to constant factors) 
as for full independence, again using only 5-wise independent hash functions. In addition, for blocked 
linear probing we bound the expected cost of any single insertion, deletion, or unsuccessful search, 
rather than the average cost of a sequence of operations. For successful searches, we analyze the 
average cost. In this case, we have a bound for 4- wise independent families as well. 

1.3 Significance 

Several recent experimental studies [2,6,9] have found linear probing to be clearly the fastest hash 
table organization for moderate load factors (30-70%). While linear probing operations are known 
to require more instructions than those of other open addressing methods, the fact that they access 
an interval of array entries means that linear probing works very well with modern architectures 
for which sequential access is much faster than random access (assuming that the elements we 
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are accessing are each significantly smaller than a cache line, or a disk block, etc.). However, the 
hash functions used to implement linear probing in practice are heuristics, and there is no known 
theoretical guarantee on their performance. Since linear probing is particularly sensitive to a bad 
choice of hash function, Heileman and Luo [6] advice against linear probing for general-purpose 
use. Our results imply that simple and efficient hash functions, whose description can be stored in 
CPU registers, can be used to give provably good performance. 

2 Preliminaries 

2.1 Notation and definitions 

Define [x] = {0, 1, . . . ,x — 1}. Throughout this paper S denotes a subset of some universe U, and 
h will denote a function from U to R = [r]. We denote the elements of S by {x±,X2, ■ ■ ■ , x n }, and 
refer to the elements of S as keys. We let n = \S\, and a = n/r. For any integers x and a define 
x Q a = x — (x mod a). The function x i— > x — [x\ is denoted by frac(x) 

A family TL of functions from U to R is /c-wise independent if for any k distinct elements 
xi, . . . ,Xk G U and h chosen uniformly at random from TL, the random variables h(x\), . . . , h(xk) 
are independent. We say that TL is e- approximately uniform if function values for random h G TL 
have Lqo distance at most e from the uniform distribution, i.e., for any x G U and any y G [r] it holds 
that | Pr{h(x) = y} — l/r| < e. We note that in some papers, the notion of fe-wise independence is 
stronger in that it is required that function values are uniform on [r]. However, many interesting 
fe-wise independent families have a slightly nonuniform distribution, and we will provide analysis 
for such families as well. 

Let Q be a subset of the range R. By Q + a we denote the translated set {(a+y) mod r | y G Q}. 
We will later use sets of form Q + h(x), for a fixed x and Q being an interval. An interval (modulo r) 
is a set of the form [b] + a, for integers a and b. 

Here we introduce a function which simplifies statements of some upper bounds. Define function 
T(a, e) with domain a G [0, 1), e G [0, ^j 2 ), and the value given by 

5.2a(l + 6) 2 4 3a 2 (l + e) 2 / 4x| 

(l-(l + e)a) 2 9a ' (1 - (1 + e)a) 4 V ^ 9a/ J ' 

Remark that T(a, e) = 0(^±§£). 

2.2 Hash function families 

Alon et al. [1] observed that the family of degree k — 1 polynomials in any finite field is &-wise 
independent. Specifically, for any prime p we may use the field defined by arithmetic modulo p to 
get a family of functions from [p] to \p] where a function can be evaluated in time O(k) on a RAM, 
assuming that addition and multiplication modulo p can be performed in constant time. To obtain 
a smaller range R = [r] we may map integers in \p\ down to R by a modulo r operation. This of 
course preserves independence, but the family is now only close to uniform. Specifically, it has 
distance less than 1/p from the uniform distribution on [r]. 

A recently proposed fc-wise independent family of Thorup and Zhang [14] has uniformly dis- 
tributed function values in [r]. From a theoretical perspective (ignoring constant factors) it is 
inferior to Siegel's highly independent family [13], since the evaluation time depends on k. We men- 
tion it here because it is the first construction that makes fc-wise independence truly competitive 
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with popular heuristics, for small k > 2, in terms of evaluation time. The construction for 4-wise 
independence is particularly efficient in practice. Though this is not stated in [14], it is not hard 
to verify that the same construction in fact gives 5-wise independence, and thus our analysis will 
apply 

2.3 Probability bounds for fully loaded intervals 

Here we state a lemma that is essential for our upper bound results, described in Section 4 and 
Section 5. The technical proof is deferred to the final section. 

Lemma 1. Let Ti be a 4-wise independent and --approximately uniform family of functions which 
map U to R, with e < 1 — ^. If h is chosen uniformly at random from Ti, then for any Q C R of 

size q, 

?*{\h(S) n Q\ > aq(l + e) + d)} < 3q V + + e) 2 
If the family of functions is 5-wise independent and e < min{l — ^, , then for any fixed 

xeU\S, 

Pr{\h(S) n (Q + h(x))\ > q)} < (3a 2 q- 2 + aq^) . 

(1 - (1 + e)a) 

3 Lower bound for pairwise independence 

Consider the following family of functions, introduced by Carter and Wegman [3] as a first example 
of a universal family of hash functions: 

H(p, r) = {x i-> ((ax + b) mod p) mod r \ < a < p, < b < p} 

where p is any prime number and r < p is any integer. Functions in 7i(p,r) map integers of [p] to 
[r]. We slightly modify TC(p,r) to be pairwise independent and have uniformly distributed function 
values. Let p = \p/r] r, and define a function g as follows: g(y,y) = y if y > p, and g(y,y) = y 
otherwise. For a vector v let Vi denote the i + 1st component (indexes starting with zero). We define: 

TL* (p, r) = {x ^ g((ax + b) mod p, v x ) mod r \ < a < p, < b < p, v £ [p] p } 

Our lower bound on the performance of linear probing will apply to both Ti and Ti* . This gives an 
example of a very commonly used hash function family that does not yield expected constant time 
per operation for linear probing for a fixed, worst case set. It also shows that pairwise independence 
is not a sufficient condition for a family to work well with linear probing, and thus complements 
our upper bounds for 5-wise (and higher) independence. 

Lemma 2 (Pairwise independence). For any pair of distinct values X\,X2 € \p], and any 
2/1)2/2 G [r], if h is chosen uniformly at random from H*(p,r), then 

Pr{/t(xi) = yi A h(x 2 ) = y 2 } = 1/r 2 . 
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Proof. We will show something stronger than claimed, namely that the family 

Tt** = {x g((ax + b) mod p, v x ) \0<a<p, 0<b<p, v£ \p] p } 

is pairwise independent and has function values uniformly distributed in [p] . Since r divides p this 
will imply the lemma. Pick any pair of distinct values x\,X2 G [p], and consider a random function 
h G H**. Clearly, v Xl and v X2 are uniform in [p] and independent. Also, it follows by standard 
arguments [3] that (ax\ + b) mod p and (0x2 + b) mod p are uniform in [p] and independent. We 
can think of the definition of h(x) as follows: The value is v x unless v x G [p], in which case we 
substitute v x for another random value in [p], namely (ax + b) mod p. It follows that hash function 
values are uniformly distributed, and pairwise independent. □ 

To lower bound the cost of linear probing we use the following lemma: 

Lemma 3. Suppose that n keys are inserted in a linear probing hash table of size r with probe 
sequences starting at i±, . . . ,i n , respectively. Further, suppose that I±, . . . , If. is any set of intervals 
(modulo r) such that we have the multiset equality Uj{i,} = U/ij. Then the total number of steps 
to perform the insertions is at least 

£ l^n/ J2 | 2 /2. 

i<n<j2<<? 

Proof. We proceed by induction on t. Since the number of insertion steps in independent of the order 
of insertions, we may assume that the insertions corresponding to Iji occur last. By the induction 
hypothesis, the total number of steps to do all preceding insertions is at least Y2i<j 1 <j 2 <e-i ^ 
ij 2 | 2 /2. Let Sj denote the set of keys corresponding to Ij. For any j < I, and any with probe 

sequence starting in Ij n Ie, the insertion of x will pass all keys of Sj with probe sequences starting 
in Ij n Ie. This means that at least \Ij n Ie\ 2 /2 steps are used during the insertion of the keys of Se 
to pass locations occupied by keys of Sj. Summing over all j < I and adding to the bound for the 
preceding insertions finishes the induction step. □ 

Theorem 1. For r = |~p/2] there exists a set S C \p], \S\ < r/2, such that the expected cost of 
inserting the elements of S in a linear probing hash table of size r using a hash function chosen 
uniformly at random from 7i(p,r) is £2(r\ogr). 

Proof. We first define S as a random variable, and show that when choosing h at random from 
H(p,r) the expected total insertion cost for the keys of S is J2(rTogr). This implies the existence of 
a fixed set S with at least the same expectation for random h G H(p,r). Specifically, we subdivide 
\p\ into 8 intervals U\, . . . , Us, such that UjC/j = \p] and r/4 > \Ui\ > r/4 — 1 for i = 1, . . . , 8, and 
let S be the union of two of the sets U±, . . . , Us chosen at random (without replacement). Note that 
|<S1 < r/2, as required. 

Consider a particular function h G H(p,r) and the associated values of a and b. Let h(x) = 
(ax + b) mod p, and let m denote the unique integer in [p] such that am mod p = 1 (i.e., m = a -1 
in GF(p)). Since h is a permutation on [p], the sets h(Ui), i = 1, ... ,8, are disjoint. We note that 
for any x, h(x + m) = (h(x) + 1) mod p. Thus, for any k, h({x, x + m, x + 2m, . . . , x + km}) is an 
interval (modulo p) of length k + 1. This implies that for all i there exists a set Li of m disjoint 
intervals such that h(Ui) = Ui^lJ. Similarly, for all i there exists a set Li of at most m+ 1 intervals 
(not necessarily disjoint) such that we have the multiset equality h(Ui) = U/g^J. Since all intervals 
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in UjLj are disjoint, an interval in UjLj can intersect at most two other intervals in UjLj. We now 
consider two cases: 

1. Suppose there is some i such that Ylh i 2 &l 1 h^i 2 l^ 1 ^ ^\ — r /16. Then with constant prob- 
ability Ui C S, and we apply the bound of Lemma 3. The sum is minimized if all 0(m) nonzero 
intersections have the same size, Q(r/m). Thus Lemma 3 implies that the number of insertion steps 
is Q{r 2 /m). 

2. Now suppose that for all i, i 2 eLi h^i 2 l^ 1 ^ ^ r/16. Note that any value in [r — 1] 
is contained in exactly two intervals of UjLj, and by the assumption at most half occur in two 
intervals of Lj for some i. Thus there exist i\,i2, h / such that ^(U^) n h(Ui 2 )\ = i?(r). With 
constant probability we have S = U Ui 2 . We now apply Lemma 3. Consider just the terms in 
the sum of the form \I\ n ^2 1 2 /2, where 1\ G and I2 G Lj 2 . As before, this sum is minimized if 
all 0(m) intersections have the same size, f2(r/m), and we derive an Q{r 2 /m) lower bound on the 
number of insertion steps. 

For a random h G 7i(p,r), m is uniformly distributed in {1, . . . ,p} (the map a 1— > a" 1 is 
a permutation of {1, ...,p}). Therefore, the expected total insertion cost is Ylm=i r2 / m ) = 
Q(r 2 logp/p) = Q(r\ogr) . □ 

Corollary 1. Theorem 1 holds also if we replace 7i(p,r) by 7i*{p,r). In particular, pairwise inde- 
pendence is not a sufficient condition for linear probing to have expected constant cost per operation. 

Proof. Consider the parameters a, b, and v of a random function in H*(p,r). Since r = [p/2] we 
have p = p + 1, and {p/p) p > 1/4. Therefore, with constant probability it holds that o / and 
v G \p] p . Restricted to functions satisfying this, the family Tt*(p,r) is identical to Tt(p,r). Thus, 
the lower bound carries over (with a smaller constant). By Lemma 2, 7i* is pairwise independent 
with uniformly distributed function values. □ 

We remark that the lower bound is tight. A corresponding 0(n log n) upper bound can be shown 
by applying the analysis of Section 4 and using Chebychev's inequality to bound the probability of 
a fully loaded interval, rather than using the 4th moment inequality as in Lemma 1. 

4 Linear probing with 5-wise independence 

We analyze the cost of performing n insertions into an empty table of size r = ^. From this, a 
bound on the cost of a successful search (of a random element in the set) can be derived. The cost 
of insertions and the average search cost do not depend on the order of insertions - or equivalently, 
on the policy of placing elements being inserted. We assume that the following policy is in effect: 
if x is the new element to be inserted, place x into the first slot h(x) + i that is either empty or 
contains an element x' such that h(x') ^ h(x) + [i + 1] . If x is placed into a slot previously occupied 
by x' then the probe sequence continues as if x' is being inserted. The entire procedure terminates 
when an empty slot is found. 

Theorem 2. LetTL be a 5-wise independent and ^-approximately uniform family of functions which 
maps U to R, with e < min {l — ^, ^-^}. When linear probing is used with a hash function chosen 
uniformly at random from TL, the expected total number of probes made by a sequence ofn insertions 
into an empty table is less than n(l + T(a, e)). 
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Proof. For every Xj £ 5, let be the displacement of x, i.e. the number such that X{ resides in 
slot (h(xi) + di) mod r. The entire cost of all insertions is equal to £^=i(l + g^)- From the way 
elements are inserted, we conclude that, for 1 < i < n and 1 < I < di, every interval h(xi) + [I] is 
fully loaded, meaning that at least I elements of S \ {x^ hash into it. Let An be the event that the 
interval of slots h(xi) + [I] is fully loaded. Then, 

r r / k \ Llg r J 

E{di) = J>r{cf, > fc} < J]Pr (P| Al < E 2 ' - Pr (^2,) 
fe=l fe=l \l=l J j=0 

Lemma 1 gives us an upper bound on Pv(A i2 j)- However, for small lengths and not small a the 
bound is useless, so then we will simply use the trivial upper bound of 1. Let K = jjz^^^z • We 
first consider the case K > 1. Denoting = |"| lg-fsT] we have 

Igr . s Igr Igr 

i=i* j'=j* j'=j* 

n) , K 1 K 1 

< 2> - 1 + — + 



2J* l - I 3a ■ 4> 1 - \ 

< \f~K ■ o^fr^Os^ - 1 + 2^ • o-^-^fls^) + — 

9a " 

The last expression is not larger than 3\fK + ^ — 1, because 2* + 2 • 2~* < 3, for t € [0,1]. 
Doing an easier calculation without splitting of the sum at index j* gives E(di) < K{2 + <^). The 
bound 3v^ + ^ — 1 is higher than the bound K{2 + ^) when K < 1, and thus we can write 
E(di) <T(a,e). □ 



5 Blocked probing 

In this section we propose and analyze a family of open addressing methods, containing among 
other a variant of bidirectional linear probing. Suppose that keys are hashed into a table of size r 
by a function h. For simplicity we assume that r is a power of two. Let VJ = { j, j + 1, . . . , j + 2 l — 1} 
where j is assumed to be a multiple of 2\ Intervals Vj may be thought of as sets of references 
to slots in the hash table. In a search for key x intervals V- that enclose h(x) are examined in 
the order of increasing i. More precisely, is examined first; if the search did not finish after 

traversing ^( a ,)Q 2 i ' then the search proceeds in the untraversed half of ^(| c )Q 2 i + 1 ■ search stops 
after traversal of an interval if either of the following three cases hold: 

a) key x was found, 

b) the interval contained empty slot(s), 

c) the interval contained key(s) whose hash value does not belong to the interval. 

In case (a) the search may obviously stop immediately on discovery of x - there is no need to 
traverse through the rest of the interval. 

Traversal of unexamined halves of intervals V- may take different concrete forms - the only 
requirement is that every slot is probed exactly once. From a practical point of view, a good choice 
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is to probe slots sequentially in a way that makes the scheme a variant of bidirectional linear 
probing. This concrete scheme defines a probe sequence that in probe numbers 2 l to 2* +1 — 1 
inspects either slots 

(h(x) 9 2* + 2\ h(x) 2 i + 2 i + 1, . . . , h(x) 9 2 ! + 2 i+1 - 1) 

or (h(x) 2* - 1, 9 2* - 2, ... , /i(x) 02*- 2*) 

depending on whether h(x) mod 2* = mod 2 t+1 or not. A different probe sequence that falls 
in this class of methods, but is not sequential, is (x, j) i-> h(x) xor j, with j starting from 0. 

Insertions. Until key x which is being inserted is placed in a slot, the same probe sequence is followed 
as in a search for x. However, x may be placed in a non-empty slot if its hash value is closer to the 
slot number in a special metric which we will now define (it is not hard to guess what kind of metric 
should that be for the above given search procedure to work). Let c2(yi,y2) = min{i | ?/2 £ ^ 1 e2 1 ^' 
The value of d(yi,y 2 ) is equal to the position of the most significant bit in which y\ and j/2 differ. 
If during insertion of x we encounter a slot y containing key x' then key x is put into slot y if 
d{h{x),y) < d(h(x'),y). In an implementation there is no need to evaluate d(h(x),y) values every 
time. We can keep track of what interval V^ x ^ Q2i * s being traversed at the moment and check 
whether h(x') belongs to that interval. 

When x is placed in slot y which was previously occupied by x', a new slot for x' has to be found. 
Let i = d(h(x'),y). The procedure now continues as if x' is being inserted and we are starting with 
traversal of V^ x/ ^ &2i \ ^~n e2 i-i • K the variant of bidirectional linear probing is used, the traversal 
may start from position y, which may matter in practice. 

Deletions. After removal of a key we have to check if the new empty slot can be used to bring some 
keys closer to their hash values, in terms of metric d. If there is an additional structure among 
stored elements, like in the bidirectional linear probing variant, some elements may be repositioned 
even though the corresponding values of metric d do not decrease. Let x be the removed key, y 
be the slot in which it resided, and i = d(h(x),y). There is no need to examine ^^.)q 2 »-i- K 
^h(x)Q2 i \ ^h{x)Q2 i - 1 contains another empty slot then the procedure does not continue in wider 
intervals. If it continues and an element gets repositioned then the procedure is recursively applied 
starting from the new empty slot. 

It is easy to formally check that appropriate invariants hold and that the above described set 
of procedures works correctly. 

5.1 Analysis 

We analyze the performance of operations on a hash table of size r when this class of probe 
sequences is used. Suppose that the hash table stores an arbitrary fixed set of n = ar elements. Let 
(7^ , C„, C£, and C% be the random variables that respectively represent: the number of probes 
made during an unsuccessful search for a fixed key, the number of probes made during an insertion 
of a fixed key, the number of probes made during a deletion of a fixed key, and the number of 
probes made during a successful search for a random element from the set. In the symbols for the 
random variables we did not explicitly include marks for the fixed set and fixed elements which are 
used in the operations, but they have to be implied. The upper bounds on the expectations of 
variables, which are given by the following theorem, do not depend on choices of those elements. 
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Theorem 3. Let H be a 5-wise independent and ^-approximately uniform family of functions 
which map U to R, with e < minjl — ^, ^^}. For a load factor a < 1, blocked probing with a 
hash function chosen uniformly at random from Tt provides the following expectations: E(C^) < 
1 + T(a,e), E(C£) < l + 2T(a,e), E{C%) < l + 2T(a,e), and 

£(0S )</ 1 + ( a2 + S)l^S&F- ifa ^ 

l^ + i^ + ^-l + iln((l-(l+^) 1 »- 4 Wl+ e )) 8/9 ) ,if«>S ' 

Proof. Denote by x the fixed element from U \ S that is being searched/inserted. Let be 
the random variable that takes value 2* when 2 i_1 < < 2*, < i < lgr. We can write 
Cl = 1 + Eli r i2 i - 1 T J , where Tj is an indicator variable whose value is 1 when at least 2* 1 + 1 
probes are made during the search. Let Aj be the event that the interval of slots VL, Q2 j * s f u ^y 
loaded, meaning that at least 2 J elements are hashed into the interval. Then Tj = 1 when the chosen 
function h is in f|}=o Aj. We get an overestimate of E{C%) with E(C%) < 1 + Y&i Pr (^-i)- 
The sum X^=o ^ P r (^) appeared in the proof of Theorem 2, so we reuse the upper bound found 
there. 

We now move on to analyzing insertions. Let be the random variable that takes value 2 % 
when 2*- 1 < < 2\ < i < lgr. Variable C% gives us the slot where x is placed, but we have 
to consider possible movements of other elements. If x is placed into a slot previously occupied 
by key x' from a "neighboring" interval ^J.) e2 »+i \ ^h( x )e2*> tnen as man y as ^ P r °bes may be 
necessary to find a place for x' in V£^ Q2i , if there is one. If entire ^i)@ 2 i+i i s ^ un y loaded, then 
as many as 2 l+1 additional probes may be needed to find a place within ^t) Q2 i+2 \ ^J)0 2 ;+i' 
and so on. In general - and taking into account all repositioned elements - we use the following 
accounting to get an overestimate of E(C^): for every fully loaded interval V^ x ^Q 2 i we cnar S e 
2 % probes, and for every fully loaded neighboring interval ^^,) e2 ;+i \ ^h(x)e2 i we a ^ so cnar g e 2* 
probes. The probability of a neighboring interval of length 2 % being full is equal to Pr(Aj). As a 

result, E{Ci) < 1 + E^o" 1 2 * ' 2Pr(A) < 1 + 2T(a, e). 
Reasoning for deletions is similar. 

A bound on E(C~.) can be derived from the bound on E(C^) by using the observation E(C^) < 
X^i^o 1 E(C-/ r ), as m nnear probing 2 . The solution to the equation t^t^^j^i = 1, over z £ [0, 1], is 
w j^. When substituting E(C* /r ) with l + 2T(£,e), we choose to use the second function from the 
expression for T(a, e) when i < j^r, and to use the first function for higher i (recall the discussion 
from the proof of Theorem 2). The obtained sums are suitable for approximation by integrals. After 
some technical work we end up with the claimed bound on E(C^). □ 

For higher values of a, the dominant term in the upper bounds is 0((1 — a)~ 2 ) (except for 
successful searches). The constants factors in front of term (1 — a)~ 2 are relatively high compared 
to standard linear probing with fully random hash functions. This is in part due to approximative 
nature of the proof of Theorem 3, and in part due to tail bounds that we use, which are weaker than 
those for fully independent families. In the fully independent case, the probability that an interval 
of length q is fully loaded is less than e 9( 1 ^ Q! + lnQ! ) ) according to Chernoff-Hoeffding bounds [4,7]. 



The equality in the relation need not hold in general for blocked probing. It holds in the bidirectional linear probing 
variant. 
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Plugging this bound into the proof of Theorem 3 would give, e.g., 



„1— a+ln a 

E ' C °> <1 + ln 2.|l- a + ln a | • « 

For a close to 1, a good upper bound on (1) is 1 + ^ (1 — a) -2 . The constant factor here is « 2.88, 
as opposed to « 5.2 from the statement of Theorem 3. As a gets smaller, the bound in (1) gets 
further below 1 + ^(1 — a)~ 2 . 

As we will show in the next theorem, 4-wise independence is sufficient to get good performance 
of successful searches. We will more directly bound the sum of displacements of all elements of S. 
However, if we tried to be very precise in calculating an upper bound, calculations would become 
complex and seemingly impossible to keep at the level of elementary functions (e.g. summing 
Yli ((i-a)2'+i)3 • Instead, we do a relatively simple calculation. A bound with concrete constants is 
given only for the case a > 0.8, as an illustration. 

Theorem 4. LetTL be a 4-wise independent and ^-approximately uniform family of functions which 

map U to R, with e < min { 1 — ^ , } . When blocked probing is used with a hash function chosen 

uniformly at random from 7i, then E(C^) < 1 + O(j^). If a> 0.8, an upper bound on E(C^) is 

6(l+£) 9 7 
l-(l+e)a ~ ZJ - 

Proof. It is sufficient to bound the expectation of Y17=i(^ + S'Lo 1 2 l Tu), where Tu is an indicator 
variable whose value is 1 if element Xi is not placed inside interval ^( x .)q 2 j- We will estimate 
E{Yli=i ^u)- F° r eas i er exposition, we introduce the symbol a = (1 + e)a. 

Let it; be a variable with domain [(1 — a)2 l + l,n], let z be a variable with domain (l,+oo), 
and set K t = (3a 2 2 21 + a2 l ). For every interval Vj we may bound the number of elements that have 
overflowed as follows: make a fixed charge of w — (1 — a)2 l — 1 elements plus the expected additional 
overflow given by: 

r/2'-iigrSl 

Yl w ( zX+1 - zX ) ■ Pr{|/i(S) n V^\ > a2 l + z x w} 

j=0 A=0 



^ — v , / Ki Ki r z — 1 



j=0 A=0 v ; 

The above inequality holds for any z > 1. Therefore, from lim^i 1 *L^--i = | it follows E(%2™ =1 Tu) < 
~[(w — (1 — a)2 l — 1 + g^-). The minimum of the upper bound is reached for w = max{v^, (1 — 
a)2' + l}. 

We will focus on the case a > 0.8. One of the simplifications that we make is to use w = \[Ki 
for <l<h= [lg j^§yz\ ■ It can be shown that \[K\ — (1 - a)2 l - 1 > 0, for < I < Z*. Unless a 
is very close to 1, index is at most one less than the optimal splitting index in this case (often, 
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it is optimal). Then, 



E (E 21 1 T <) < r E (\VK> - (1 - a)* - l) 
\;=o i=l / z=o v 7 



z»-i 



1=0 



< r J] ^v 7 ^ 3 + -L - r (l - a)(2 z * - 1) - r ■ U 



4 fK 2 1 */ 2 

< r^- ■= \/4~25 - r(l - a)2 l * - 3.8r • (because a > 0.8) 

3 y2 — 1 



For the second part of the sum, we have 

/lgr-l n \ lgr-l 

W E 2l E T A < r E ' 



= «» 1=1 / J: 



-f 3(l-a) 3 2 3 ' 



a a 



Z=Z* 



+ 



(l-a) 3 2' 3(l-a) 3 2 2 ' 



r / 2a 2 4a 



(1 - a) 3 V 9 • 2 21 * 

Merging the sums and substituting 2 l * results in 

E ( E 2 'E TiZ ) < Y^-^(6.083 • 2"* /2 - 1.731 • 2"* + 1.155 • 2 f ) - 3.8r + (1 



a)r- 



1=0 



a 



where t = frac(lg j0§?i )- It can be shown that 6.083 -2-*/ 2 - 1.731 -2"* + 1.155-2* < 6. Substituting 
0.14^(1 a) ^ _|_ ^ ^.^g max i mum value over a G [0.8, 1] yields the upper bound for this case. 

Performance for other values of a can be analyzed in a similar simple way, by suitably choosing 
indexes I for which w is set to \[K\ (e.g. sometimes it will be for I G {1, 2, 3}) When a is very close 
to 0, analysis for a = c, with c being a small constant, is applied. □ 

Achieving a significantly smaller constant factor is planned for future work. A more elegant 
proof is required. 



6 The proof of Lemma 1 

Let Xi be the indicator random variable that has value 1 iff h{xi) G Q, 1 < i < n. By our 
assumptions on H, the variables Xi are 4- wise independent and 2(1 — e) < Pr{Xj = 1} < 2(1 + e). 
We use the 4th moment inequality: 

Pr{|X-^|>d}<- E((X -" )4) 



d* 

with X = Y27=i Xi, = E{X). In terms of raw moments, the 4th central moment is expressed as 
E((X - /i) 4 ) = E(X 4 ) - 4fiE(X 3 ) + 6fi 2 E(X 2 ) - 3/x 4 . 
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We will expand the raw moments and express them in a form that will allow later cancellation of 
high-order terms. The simplest is the second moment: 

n „ 

E (x 2 ) = E ( ( E ) = E ( E x ' + E Si*) = E ( E * + E 

i=l i i^j i i^j 

n n 

= e(x) + E E ^(^^) = ^po + E E (Xi) E ^(^) 

1=1 jjLi i=l jr'^j 

n 

= £(X) + £ ^ TOWO - E{Xi)) = E(X) + E(Xf - £(£(X)) 2 . 

i=l i 

The equality X? = X; L is true because Xj is an indicator variable. Also, E(XiXj) = E(Xi)E(Xj), 
i ^ j, because Xj and Xj are independent. Defining a k = Y2i( E (Xi)) k , the identity is written more 
succinctly as: E(X 2 ) = jjl + /j, 2 — a 2 ■ 

Define the predicate A such that A(a\, a 2 , . . . , a k ) is true iff a±, . . . , a k are all distinct. For the 
third moment we have, using independence of any three indicator variables: 

E{X^) = E^Y,X) ) =E(Y,X? + Y,3X?X j+ £ XiXjX*) 

i=l i i^j A(i,j,k) 

= E(Y J X i + Y 3X iX J + Y XiXjX^j 

n 

= fl + 3 fJ 2 -3a 2 + Y E (Xi)J2 E (Xj) E E (Xk) 

i=l j^i A(i,j,k) 

= V + 3fi 2 -3a 2 + J2 E (Xi) E E (Xj)(E(X) - E{X 3 ) - E{X t )) 

i j^i 

= v + 3fi 2 - 3a 2 + Y E (Xi) ((/* - E(Xi)) 2 - Y E{X 3 f) 

i j^i 
= fi + 3fl 2 - 3(72 + fi 3 — 3fMT 2 + 2cr 3 . 

Expanding E(X A ) in the same way, using 4-wise independence (and reusing some of the previous 
calculations) yields: 

n 4 

E ({Y,x^) ) =^(E^ 4 + E 4 ^ + E 3 ^ 2x I+ E Q x 2 x j x k + y x^x^) 

i=l ' i v£j A(i,j,k) A(i,j,k,l) 

= H + 7fi 2 - 7a 2 + 6fJ> 3 - 18fia 2 + 12a 3 + n A - 6fi 2 a 2 + 8fia 3 + 3{a 2 f - 604 . 
Combining all the identities results in: 

E((X - fi) 4 ) = n + 3/j, 2 - 7a 2 - 6^a 2 + 12ct 3 + 3(a 2 ) 2 - Qa 4 . 
Because of E(Xi) < 1, it follows that \i > a 2 > 03. Thus — 3/xcr 2 + 3(o2) 2 < 0, and therefore 
E((X - fi) 4 ) < n + 3fi 2 - 3fia 2 + 12cr 3 < ft + 3fi 2 - a 2 (3aq(l - e) - -(1 + e)) < p + 3fi 2 . 
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The last inequality is true due to the condition that e < 1 — -, from the statement of the lemma. 
Noticing that 3/i 2 + fx < 3a 2 q 2 (l + e) 2 + aq(l + e), finishes the first part of the lemma. 

For the second part, observe that in the subfamily {h G TL \ h(x) = a}, for a fixed a, is 4-wise 
independent and ^-approximately uniform. Using the result of the first part by conditioning on the 
event h(x) = a, and then applying the total probability theorem yields the claimed inequality. 
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